Search CORE

136 research outputs found

A Convergence Theorem for the Graph Shift-type Algorithms

Author: Cao Longbing
Fan Xuhui
Publication venue
Publication date: 12/06/2013
Field of study

Graph Shift (GS) algorithms are recently focused as a promising approach for discovering dense subgraphs in noisy data. However, there are no theoretical foundations for proving the convergence of the GS Algorithm. In this paper, we propose a generic theoretical framework consisting of three key GS components: simplex of generated sequence set, monotonic and continuous objective function and closed mapping. We prove that GS algorithms with such components can be transformed to fit the Zangwill's convergence theorem, and the sequence set generated by the GS procedures always terminates at a local maximum, or at worst, contains a subsequence which converges to a local maximum of the similarity measure function. The framework is verified by expanding it to other GS-type algorithms and experimental results

arXiv.org e-Print Archive

In-Depth Behavior Understanding and Use: The Behavior Informatics Approach

Author: Cao Longbing
Publication venue: 'Elsevier BV'
Publication date: 02/07/2020
Field of study

The in-depth analysis of human behavior has been increasingly recognized as a crucial means for disclosing interior driving forces, causes and impact on businesses in handling many challenging issues. The modeling and analysis of behaviors in virtual organizations is an open area. Traditional behavior modeling mainly relies on qualitative methods from behavioral science and social science perspectives. The so-called behavior analysis is actually based on human demographic and business usage data, where behavior-oriented elements are hidden in routinely collected transactional data. As a result, it is ineffective or even impossible to deeply scrutinize native behavior intention, lifecycle and impact on complex problems and business issues. We propose the approach of Behavior Informatics (BI), in order to support explicit and quantitative behavior involvement through a conversion from source data to behavioral data, and further conduct genuine analysis of behavior patterns and impacts. BI consists of key components including behavior representation, behavioral data construction, behavior impact analysis, behavior pattern analysis, behavior simulation, and behavior presentation and behavior use. We discuss the concepts of behavior and an abstract behavioral model, as well as the research tasks, process and theoretical underpinnings of BI. Substantial experiments have shown that BI has the potential to greatly complement the existing empirical and specific means by finding deeper and more informative patterns leading to greater in-depth behavior understanding. BI creates new directions and means to enhance the quantitative, formal and systematic modeling and analysis of behaviors in both physical and virtual organizations

arXiv.org e-Print Archive

Asymptotic power of likelihood ratio tests for high dimensional data

Author: Cao Longbing
Miao Baiqi
Wang Cheng
Publication venue
Publication date: 13/02/2013
Field of study

This paper considers the asymptotic power of likelihood ratio test (LRT) for the identity test when the dimension p is large compared to the sample size n. The asymptotic distribution of LRT under alternatives is given and an explicit expression of the power is derived. A simulation study is carried out to compare LRT with other tests. All these studies show that LRT is a powerful test to detect eigenvalues around zero. Key words and phrases: Covariance matrix, High dimensional data, Identity test, Likelihood ratio test, PowerComment: 10 pages, 2 figure

arXiv.org e-Print Archive

Non-parametric Power-law Data Clustering

Author: Cao Longbing
Fan Xuhui
Zeng Yiling
Publication venue
Publication date: 12/06/2013
Field of study

It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the

k

-means clustering procedure. This approach shows simplicity in implementation and solidity in theory, while it also provides a feasible way to inference in large scale datasets. However, several problems remains unsolved in this pioneering work, including the power-law data applicability, mechanism to merge centers to avoid the over-fitting problem, clustering order problem, e.t.c.. To address these issues, the Pitman-Yor Process based k-means (namely \emph{pyp-means}) is proposed in this paper. Taking advantage of the Pitman-Yor Process, \emph{pyp-means} treats clusters differently by dynamically and adaptively changing the threshold to guarantee the generation of power-law clustering results. Also, one center agglomeration procedure is integrated into the implementation to be able to merge small but close clusters and then adaptively determine the cluster number. With more discussion on the clustering order, the convergence proof, complexity analysis and extension to spectral clustering, our approach is compared with traditional clustering algorithm and variational inference methods. The advantages and properties of pyp-means are validated by experiments on both synthetic datasets and real world datasets

arXiv.org e-Print Archive

Data Science: Nature and Pitfalls

Author: Cao Longbing
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/06/2020
Field of study

Data science is creating very exciting trends as well as significant controversy. A critical matter for the healthy development of data science in its early stages is to deeply understand the nature of data and data science, and to discuss the various pitfalls. These important issues motivate the discussions in this article

arXiv.org e-Print Archive

Data Science: Challenges and Directions

Author: Cao Longbing
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/06/2020
Field of study

While data science has emerged as a contentious new scientific field, enormous debates and discussions have been made on it why we need data science and what makes it as a science. In reviewing hundreds of pieces of literature which include data science in their titles, we find that the majority of the discussions essentially concern statistics, data mining, machine learning, big data, or broadly data analytics, and only a limited number of new data-driven challenges and directions have been explored. In this paper, we explore the intrinsic challenges and directions inspired by comprehensively exploring the complexities and intelligence embedded in data science problems. We focus on the research and innovation challenges inspired by the nature of data science problems as complex systems, and the methodologies for handling such systems

arXiv.org e-Print Archive

Coupling Learning of Complex Interactions

Author: Cao Longbing
Publication venue: 'Elsevier BV'
Publication date: 01/07/2020
Field of study

Complex applications such as big data analytics involve different forms of coupling relationships that reflect interactions between factors related to technical, business (domain-specific) and environmental (including socio-cultural and economic) aspects. There are diverse forms of couplings embedded in poor-structured and ill-structured data. Such couplings are ubiquitous, implicit and/or explicit, objective and/or subjective, heterogeneous and/or homogeneous, presenting complexities to existing learning systems in statistics, mathematics and computer sciences, such as typical dependency, association and correlation relationships. Modeling and learning such couplings thus is fundamental but challenging. This paper discusses the concept of coupling learning, focusing on the involvement of coupling relationships in learning systems. Coupling learning has great potential for building a deep understanding of the essence of business problems and handling challenges that have not been addressed well by existing learning theories and tools. This argument is verified by several case studies on coupling learning, including handling coupling in recommender systems, incorporating couplings into coupled clustering, coupling document clustering, coupled recommender algorithms and coupled behavior analysis for groups

arXiv.org e-Print Archive

Data Science: A Comprehensive Overview

Author: Cao Longbing
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/06/2020
Field of study

The twenty-first century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data DNA and its organisms relies on the new field of data science and its keystone, analytics. Although it is widely debated whether big data is only hype and buzz, and data science is still in a very early phase, significant challenges and opportunities are emerging or have been inspired by the research, innovation, business, profession, and education of data science. This paper provides a comprehensive survey and tutorial of the fundamental aspects of data science: the evolution from data analysis to data science, the data science concepts, a big picture of the era of data science, the major challenges and directions in data innovation, the nature of data analytics, new industrialization and service opportunities in the data economy, the profession and competency of data education, and the future of data science. This article is the first in the field to draw a comprehensive big picture, in addition to offering rich observations, lessons and thinking about data science and analytics

arXiv.org e-Print Archive

Non-IID Recommender Systems: A Review and Framework of Recommendation Paradigm Shifting

Author: Cao Longbing
Publication venue: 'Elsevier BV'
Publication date: 01/07/2020
Field of study

While recommendation plays an increasingly critical role in our living, study, work, and entertainment, the recommendations we receive are often for irrelevant, duplicate, or uninteresting products and services. A critical reason for such bad recommendations lies in the intrinsic assumption that recommended users and items are independent and identically distributed (IID) in existing theories and systems. Another phenomenon is that, while tremendous efforts have been made to model specific aspects of users or items, the overall user and item characteristics and their non-IIDness have been overlooked. In this paper, the non-IID nature and characteristics of recommendation are discussed, followed by the non-IID theoretical framework in order to build a deep and comprehensive understanding of the intrinsic nature of recommendation problems, from the perspective of both couplings and heterogeneity. This non-IID recommendation research triggers the paradigm shift from IID to non-IID recommendation research and can hopefully deliver informed, relevant, personalized, and actionable recommendations. It creates exciting new directions and fundamental solutions to address various complexities including cold-start, sparse data-based, cross-domain, group-based, and shilling attack-related issues

arXiv.org e-Print Archive

Characterizing A Database of Sequential Behaviors with Latent Dirichlet Hidden Markov Models

Author: Cao Longbing
Cao Wei
Fan Xuhui
Song Yin
Zhang Jian
Publication venue
Publication date: 24/05/2013
Field of study

This paper proposes a generative model, the latent Dirichlet hidden Markov models (LDHMM), for characterizing a database of sequential behaviors (sequences). LDHMMs posit that each sequence is generated by an underlying Markov chain process, which are controlled by the corresponding parameters (i.e., the initial state vector, transition matrix and the emission matrix). These sequence-level latent parameters for each sequence are modeled as latent Dirichlet random variables and parameterized by a set of deterministic database-level hyper-parameters. Through this way, we expect to model the sequence in two levels: the database level by deterministic hyper-parameters and the sequence-level by latent parameters. To learn the deterministic hyper-parameters and approximate posteriors of parameters in LDHMMs, we propose an iterative algorithm under the variational EM framework, which consists of E and M steps. We examine two different schemes, the fully-factorized and partially-factorized forms, for the framework, based on different assumptions. We present empirical results of behavior modeling and sequence classification on three real-world data sets, and compare them to other related models. The experimental results prove that the proposed LDHMMs produce better generalization performance in terms of log-likelihood and deliver competitive results on the sequence classification problem

arXiv.org e-Print Archive